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ABSTRACT 


Recently Big Data has become one of the important new factors in the 
business field. This needs to have strategies to manage large volumes of 
structured, unstructured and semi-structured data. It’s challenging to analyze 
such large scale of data to extract data meaning and handling uncertain 
outcomes. Almost all big data sets are dirty, i.e. the set may contain 
inaccuracies, missing data, miscoding and other issues that influence the 
strength of big data analytics. One of the biggest challenges in big data 
analytics is to discover and repair dirty data; failure to do this can lead to 
inaccurate analytics and unpredictable conclusions. Data cleaning is an 
essential part of managing and analyzing data. In this survey paper, data 
quality troubles which may occur in big data processing to understand clearly 
why an organization requires data cleaning are examined, followed by data 
quality criteria (dimensions used to indicate data quality). Then, cleaning 
tools available in market are summarized. Also challenges faced in cleaning 


big data due to nature of data are discussed. Machine learning algorithms can 
be used to analyze data and make predictions and finally clean data 
automatically. 
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1, INTRODUCTION 

In 2016, IBM estimated that in last two years only, around 2.5 quintillion bytes’ data have been 
produced each day, which is currently 90% of total data [1]. This big data is usually created using devices 
like sensors and new technologies evolving in today’s era, even more the data evolution amount will possibly 
accelerate. Whereas, Cisco forecasted by 2020, the volume of worldwide traffic will cross the Internet with 
IP WAN networks may reach to 2.3ZB each year [2]. 

The bulky and heterogeneous nature of big data requires investigation using Big data analytics. Big 
data analytics helps to discover concealed patterns, anonymous relationships, trends of current market 
situation, consumer preferences and other aspects of data that can assist institutes and companies to make up- 
to-date, faster and better decision for business. 

By now, most well-known companies realized the demand of implementing big data analytics into 
their system for better products and services. Using big data capabilities any company can improve their 
products and services outcomes and grow productivity by obtaining meaningful visions to advance their work 
forward. There are different tools available in market to handle the big data but these tools concernts with 
few issues [3]. These tools are not usually integrated with data quality managment, therefore, in market the 
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tools for data quality estimated by 2022 to reach 1,376.7 Million from USD 610.2 Million in 2017, where the 
Compound Annual Growth Rate measured is 17.7%. The base year considered for this report is 2016 and the 
forecast period is 2017—2022 [4]. It’s not only use of big data capabilities an organization required to collect 
values without mistakes, incomplete values besides errors but it is very often negated too. This kind of data is 
usually known to dirty data, and to clean this data can be challenging for companies who want to get better 
results. Cleaning data manually requires experience and often human tent to make mistake. Currently, 
machine learning is adopted in different area for process the tasks automatically, such as [5, 6] . Therefore, as 
machine learning can help any task to complete automatically it is possible to clean dirty data by training 
classification models. 


2. BIG DATA ANALYTICS 
The general procedure for obtaining visions from Big Data can be break down into five main stages 
[7]-[9] as shown in Figure 1. 
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Figure |. Processes for extracting insights from big data 


Data Acquisition: Timeliness is one of the important requirement while data loading [10]. The 
fundamental characteristics of Big Data with its exponential rate of growing demands improve exceptional 
issue in Big Data engineering such as data acquisition and storing [7]. 

Data Mining and Cleansing: The most essential stage of processing big data is to implement a 
method to extract from loaded un-structured Big Data and mine-out the necessary data to able to coherent it 
in a typical and organized arrangement that will be easy to recognize. Data cleaning process is helps to clean 
dirty data. 

Data Aggregation and Integration: The cleaned data obtained required to aggregate for processing 
these data by gathering and expressing into summary form [11], [12] following by integrating Data, to 
organize data from disparate sources by grouping of practical and business methods, and obtain meaningful 
and valued result [12]. 

Data Analysis and Modelling: From the viewpoint of Big Data, the goals are to produce business 
significance through the analysis of data which may fluctuate according to technique and data form. 
Construct and investigate meaningful reports to help the business for better and faster decision making. 

Data Interpretation: Presenting data in understandable form for users, i.e. presenting data using 
analysis and modelling results to make decision by interpreting the outcomes and extracting knowledge. Data 
Interpretation queries are categorized together and indicate to the same table, diagram graph or other data 
demonstration options 


3. DATA QUALITY PROBLEMS 

The data cleaning process gets more complex when data comes from heterogeneous sources. Here, 
data quality problem has to be solved by data cleaning and data transformation. Despite of the various 
viewpoints on the effect of data quality, in the end, all have the probability to produce in economic expenses 
for groups. Some of survey of real case, involving economic costs due to dirty data, on a survey in 2014 its 
found that around $13.3 million dollars’ annual costs in organizations and 3 trillion per year to US economy 
due to bad data. Another organization, the U.S. Postal Service, recognizes the cost of bad data, in 2013, an 
estimated amount of mail unsuccessful delivering to mentioned address was around 6.8 billion, which racks 
up to $1.5 billion in managing costs [13]. 
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By some evaluations it is known that the in organizations and companies issue of dirty data already 
reached to epidemic amounts. The issue is equally prevalent and hypothetically equal beyond frightening in 
health care and other organization. [14]. For instance, in a telecommunication industry, dirty data has 
numerous costs. First and foremost, Experian approximates average 12% loses in business due to wrong 
records causing productivity reduction, resources wastage, and significantly, misused chances for marketing 
of cross-channel. The Experian investigation also focuses that approximately one-third of responders think 
that they waste almost 10% or more budget in marketing because of outcome obtained from inaccurate data. 
The Experian presents that 25% of survey participants in their research presently in their organization do not 
measures accuracy of data, where growths in telecoms and utilities companies to 33%, and in organizations 
like governments reaches to 36% [15]. 

These measurements are within organizations, whereas observing external maters like marketing, 
marketers struggle with dirty data as well. Regarding to BizReport.com, “...marketers are generating a large 
portion of poor-quality leads, including those with improper formatting and even inaccuracies. Bad prospect 
information can have negative consequences, including wasted media investment, squandered resources, and 
poor customer experience, which marketers simply can’t afford.” [16] 

In medical case, errors can able to kill patients or produce long lasting harm to heath of the patient. 
In 1999 an institute of Medicine reported [17] approximations, for instance, at least 44,000 to 98,000 people 
lost their lives each year for medical errors in hospitals only and which caused more $17 to $29 billion 
annually in healthcare costs. Other than heath issue, dirty data can also be involved in privacy issue for 
patients. 


4. DATA QUALITY CRITERIA 

Data quality is generally described as the capability of data to satisfy stated and implied needs when 
used under specified conditions [18]. Data accuracy, completeness and consistency are most popular 
Initiatives to address Data quality [19],[20], beside other dimensions like Accessibility, Consistent 
representation, timeliness, Understandability, Relevancy, etc. [19]. Moreover, data quality is combination of 
data content and form. Where data content must contain accurate information and data form essential be 
collected and visualized in an approach that creates data functioning. Content and form are significant 
consideration to reduce data mistakes, as they illuminate the task of repairing dirty data needs beyond simply 
providing correct data. 

Likewise, while developing a scheme to improve data quality it is essential to identify the primary 
reasons of dirty data. The causes are categories into organized and unintentional errors. The basic sources of 
producing systematic errors include while programming, wrong definition for data types, rules not defined 
correctly, data collection’s rules violation, badly defined rules, and trained poorly. The sources of random 
errors can be errors due to keying, unreadable script, data transcription complications, hardware failure or 
corruption, and errors or intentionally misrepresenting declarations on the portion of users specifying major 
data. Human role on data entry usually result error, this error can be typos, missing types, literal values, 
Heterogeneous ontologies (i.e. Different nature of data), Outdated values or Violations of integrity 
constraints. Similarly, see Figure 2. as an example, where few data quality problems can be identified in the 
Wireless Service Facility Permits (City of San Francisco) database. 
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Figure 2. Data quality problems identified in an open dataset 


Therefore, the most common dimensions of dirty data including data duplication are: 
Inaccurate data refers to any field contains wrong values. A right value of data will bring accurate 
and signified arrangement of consistency and unambiguous. 
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Incomplete data from missing data is produced by data sets basically missing values. These type of 
data considered concealed when the amount of values identified in a set, but the values themselves are 
unidentified, and it is also known to be condensed when there are values in a set that are eliminated. 

Inconsistent data is data redundancy; i.e. same data value is stored in different files which may be in 
different formats. 

Duplicate data is entries that have been added by a system user same data multiple times 


5. CLEANING TOOLS 

Different vendors provide data cleansing solutions, includes Tal presents the website link of the 
company. Where, the “like (s)” and “dislike (s)” are obtained from Customers comments obtained from 
different websites, like end, IBM, SAS, Oracle and Lavastorm Analytics. There are some free tools been 
work on data transformation [21] [22], such as, OpenRefine, plyr, and reshape2, although it is uncertain 
whether they can execute Big Data. Another well-known tool is ETL tools, which provides complex data 
conversion techniques by merging and repairing data [23]. A summarization of some available 
commercialized tools to manage Data Quality in presented in Table 1. Where the “Vendor” field mentions 


the company offering the tools and “Product” mentions the tool offered by the vendor for managing Data 
Quality. “Website” column [24], [25]. 


Table 1. Comparison of Commercialized Data Quality Management Tools 


Vender Product Website Like (s) Dislike (s) 
Trifacta Trifacta Data Wrangler trifacta.com mee Y esa Ainporied elie formula based 
file and provides prescriptive methods 
Data Quality Standard Edition 
gars Nene OneTVACES Easy interact with provided interface to 
Informa and StrikelIron 
Ae Data Oualiy Advanced Se ree identify the functions, It requires SQL 
mae y ; Ease of Data Migration, knowledge 
IDQ Edition 
Completely on cloud 
Data Quality Governance 
Edition 
Information Steward RS Bancaeecniel 
SAP Data Quality Management go.sap.com Ability to recognise organization's needs P 
Source code and integrate 
SAP Data Services 
Data Quality Components for 
SSIS API's are easy and straightforward No better documentation, 
Personator Able to use phoenics for address Performance is slow for 
Melissa 
Data Global Data Quality Suite melissadata.com corrections real time queries. 
Global MatchUp 
Melissa Listware a P end contacts, Standardize Address, No Mac OS integration 
Simple interface 
BDNA Technopedia eye aN ae eres Need to point out stale 
BDNA BDNA Normalize bdna.com pine ie data, it will not refresh for 
Tech di date, Adopt maturing technologies with ie sea 
same ca manageable risk are 
Needs training and 
Data Management : ; 
SAS D lity Deskt sas.com The learning curve is manageable. education to use, no 
ata Qualiiy: Desttop command window 
Capture, Clean and Enhance 
Experia cae quell tools Low cost and flexibility of use with 
Experian Pandora experian.com 
n various file formats. 
Experian Data Quality 
Platform 
Speci recunoloey User interface is quite friendly and Hard to integrate and 
Pitney Platform : 
pitneybowes.com attractive, Can create APIs without handle large amounts of 
Bowes Code-1 Plus . 
programming data 
Wapaseineeeasieda Unable to enter a custom 
CRMfu DemandTools Be tars bee SOQL (e.g. with a 
crmfusion.com in real time Standardize, cleanse and 
sion CRM fusion PeopleImport subquery) as the basis for 
overall manipulate data 
the data pulled down 
emphases in ETL instead 
Oracle Enterprise Data Quality Profiling customers easily of data context and 
Oracle oracle.com 
Great for batch-oriented processing management, not good for 
real-time processing 
Spates Quays The lineage integrates metadata from 
Infosphere Information ; 
IBM ibm.com Cognos, Datastage, Quality Stage, and 
Analyzer 
Oracle Metadata. 
InfoSphere Information Server 
Only a simple JavaScript snippet is 
Address Addieeey sddressecon required on the page the rest of the 
y configuration can be done via the control 


panel 
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6. BIG DATA ANALYTICS DATA CLEANING CHALLENGES 

Generally, the data gathered will not be in a ready form for analyzing. For instance, consider data 
obtained from Telecommunication stored system, consisting of feedback obtained from different agents and 
structured data from routers. It is challenging to analyze such types of unstructured data. Requirement of 
extraction procedure that recovers necessary data from various sources and demonstrates it in a structured 
arrangement appropriate for analysis is compulsory. Data cleaning is an essential portion of data analysis and 
challenging too [26]. Researcher from data base research community offered few challenges to obtain useful 
data from big data [27], [28]. This is challenging through every data analysis, but after involving the variety 
and voluminous big data, it transforms even beyond pronounced. The data quality required to assured for 
accurate and correct data visualization. To deal this issue, organization require to overcome some common 
challenges: 


6.1. Scalability 

Cleaning techniques required scaling data capacities as quickly increasing data size of Big data, 
which is quite challenging. Existing procedures involve jamming data for identical data detection [29], [30], 
identification and linkage for data cleaning [30], clean data using sampling [31], and distributed data cleaning 
[32]. 


6.2. Semi Structured and Unstructured Data 

Big data is usually set of variety of data, which may be populated with semi structured layout data 
e.g. in XML/JSON and unstructured format data e.g. in word-processing files, in e-mail besides in text fields 
in databases. Semi structured and unstructured data remain mostly unfamiliar for Data quality problems [28, 
33]. 


6.3. User Engagement 

While much research work was involved humans to execute deduplication process in data set. For 
instance, through active learning, including human expert in other to clean data [30], like getting user 
response to determine rules for data quality, is still to be discovered. 


6.4. Raising Privacy and Security Interests 

While cleaning data the most common task is to observe and examine complete set of raw data 
value which may be restricted by some domain is a significant challenges [9], like telecommunication, 
medicine and finance. For example, telecommunication data, such as the Internet connection login sessions 
log collected over an extensive period of time can reveal an individual’s location and behavior, as shown in 
Figure 3. 
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Figure 3. Information gathered from running analytics on data and files to create Tawsif's profile 


6.5. Computational Complication for Data Streaming 
Huge data collection from variety of sensors and user devices is always an interesting issue. Gartner, 
Inc. forecasted in 2017, that 8.4 Billion devices will be linked things and used in global in 2017, up 31% 
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from 2016, and will reach 20.4 billion by 2020 [34]. This is the reason data cleansing actions may engage 
huge processing power. 


6.6. Machine Learning and Other Algorithms 

Lastly, it known that big data analytics is still in its initial periods of development as a technical 
discipline. Hence many Machine Learning algorithms usable to scale big data sets or unable to tolerate the 
noises and gaps produced by real world [35]-[38]. There is still further research going to to improve these 
algorithms that will be more suitable with real world conditions which may contain millions and trillions of 
components for data cleaning. 


6.7. Manually 
Currently, after benefit of histograms, conversation tables and rules with algorithms individual 
interference is nevertheless compulsory to recognize and repair the data [30], [39]. 


7. MACHINE LEARNING PARADIGMS FOR BIG DATA CLEANING 

Currently there are different types of learning paradigms available in machine learning; but, not all 
types applicable to all field. For instance, [40] presented a cleaning approach using Data mining and SVM (a 
machine Learning Paradigm). Machine Learning techniques can be used to teach the system and complete the 
task my minimum human interaction. It may reduce the time and resources required to analyze and transform 
dirty data to usable clean data. Machine Learning techniques are used to make system intelligent by learning 
capability. Data can be classified by three ways, un-supervised, supervised and semi supervised methods. 
Selection of algorithms must be dependent on the size, quality, and nature of the data. Some common 
learning algorithms can be used to clean data are shown in Figure 4. 
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Figure 4. Machine learning algorithms 
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7.1. Deep Learning 

This technique is widely used by data representation, rather than data features to execute data 
cleaning. Deep Learning Algorithms transforms data into abstract representations that allows learning 
features. Hence, there is no requirement for feature extraction as the features are learned right from the data. 
Due to nature of Big data, the capability to ignore feature extraction step is great deal. 


7.2. Naive Bayes Classifier Algorithm 

This algorithm provides classification parameter and attributes to label the occurrences must be 
conditionally independent, if the instance contains several attributes. This algorithm is suitable for moderate 
or large training data set. 


7.3. K-Means Clustering Machine Learning Algorithm 
K-Means produces stronger clusters than hierarchical clustering in case of globular clusters. And for 
large number of variable K-Means clustering executes speedier than hierarchical clustering. 


7.4. Apriori Algorithm 
Apriori Algorithm is easy to implement and can be parallelized easily. Which uses large item set 
properties to implement. 


7.5. Random Forest Machine Learning Algorithms 

Random Forest is very less robust to noise, which makes it more efficient and versatile for 
classification and regression jobs. It is easy to define which parameters to use, since it’s not delicate to the 
parameters required to run. This algorithm can be grown in parallel and efficient for large database with 
higher classification accuracy. 


8. CONCLUSION 

In recent years, probably big data processing brought the greatest revolution in computing. The data 
cleaning of massive sizes of data lies at the heart of big data analytics processing for all purpose of domains 
for better data investigation. 

In this paper, an overview is initiated to identify the potential of data cleaning in big data analytics 
in the process of gathering, arranging and processing information. It 1s important to understand data quality 
criteria of dirty data to able to clean data sets without failure. A comparison of commercialized tools is 
presented by obtaining comments from different customers. Most of the tools mostly concerns to organize 
data sets and clean messy data and very methods uses machine learning. But they didn’t give much 
importance to big data characteristics, which may lead to big challenge while cleaning data. There are many 
available data repairing algorithms, still 1t required human expert to take intelligent decision if the cleaning 
process is correct or not. Machine learning algorithms will probably replace most jobs in the world, with the 
fast evolution of big data and accessibility of programming tools like Python and R , machine learning is 
increasing mainstream existence for data scientists. Machine learning applications are highly automated and 
self-modifying which continue to improve over time with minimal human intervention as they learn with 
more data. 

This survey has prompted us to conduct additional real-world evaluations and develop a modified 
framework of big data analytics by changing structure of cleaning phase to get more clear visions of data. It 
is expected to produce a new plan regarding the structure of data quality techniques which can be more 
efficient in big data analytics. 
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